Lab Topics

Goals:

After this lab you will be able to:

  • fit and select multiple regression models

  • identify best-fit models

  • interpret coefficients and make predictions us9ing multiple regression models

  • test for multicollinearity and possible effects

This lab uses materials by

  • Andrew Bray

  • Andrew Bray and Mine Cetinkaya-Rundel

  • Matthew Salganik

Example: Grading the professor

Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings. (Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, August 2005, Pages 369-376, ISSN 0272-7757, 10.1016/j.econedurev.2004.07.013. http://www.sciencedirect.com/science/article/pii/S0272775704001165.)

In this lab we will analyze the data from this study in order to learn what goes into a positive professor evaluation.

The data

The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance. (This is aslightly modified version of the original data set that was released as part of the replication data for Data Analysis Using Regression and Multilevel/Hierarchical Models (Gelman and Hill, 2007).) The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.

load("input/evals.RData")
variable description
score average professor evaluation score: (1) very unsatisfactory - (5) excellent.
rank rank of professor: teaching, tenure track, tenured.
ethnicity ethnicity of professor: not minority, minority.
gender gender of professor: female, male.
language language of school where professor received education: english or non-english.
age age of professor.
cls_perc_eval percent of students in class who completed evaluation.
cls_did_eval number of students in class who completed evaluation.
cls_students total number of students in class.
cls_level class level: lower, upper.
cls_profs number of professors teaching sections in course in sample: single, multiple.
cls_credits number of credits of class: one credit (lab, PE, etc.), multi credit.
bty_f1lower beauty rating of professor from lower level female: (1) lowest - (10) highest.
bty_f1upper beauty rating of professor from upper level female: (1) lowest - (10) highest.
bty_f2upper beauty rating of professor from second upper level female: (1) lowest - (10) highest.
bty_m1lower beauty rating of professor from lower level male: (1) lowest - (10) highest.
bty_m1upper beauty rating of professor from upper level male: (1) lowest - (10) highest.
bty_m2upper beauty rating of professor from second upper level male: (1) lowest - (10) highest.
bty_avg average beauty rating of professor.
pic_outfit outfit of professor in picture: not formal, formal.
pic_color color of professor’s picture: color, black & white.

Exploring the data

  1. Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.

This is an observational study; we did not randomize professors’ “looks” nor did we randomly assign studnets to different classes. Thus, we can’t make a causal argument in this case (since we don’t know whether it is beauty that CAUSES better evaluations). Rather, we can re-state as: “Is there an association between instructor looks and teaching evaluations?” or something to that effect.

  1. Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?
hist(evals$score)

This distribution is left skewed; it looks like students generally rate courses pretty highly, with many more being close to 5 rather than near 0. This is what I would have expected, for several reasons; one, there is a bit of self-selection going on here: students opt into classes, and professors are chosen for a place like UT on the basis of aptitude, so in both cases we would assume that scores are more likely to be high than low.

  1. Excluding score, select two other variables and describe their relationship using an appropriate visualization (scatterplot, side-by-side boxplots, or mosaic plot).

I’ll compare the distribution of scores by gender.

plot(evals$score~evals$gender)

It looks like there isn’t much difference; males rate courses slightly higher on average, but the primary difference seems to be less of a leftward skew for females (i.e., a few more lower rated scores).

Simple linear regression

The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:

library(ggplot2)
ggplot(evals,aes(y=score,x=bty_avg)) + geom_point()

Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?

dim(evals)
## [1] 463  21

Hrrm, I don’t see 463 points!

  1. Replot the scatterplot, but this time use the function geom_jitter() on the \(y\)- or the \(x\)-coordinate. (Use ?geom_jitter to learn more.) What was misleading about the initial scatterplot?
p = ggplot(evals,aes(y=score,x=bty_avg)) + geom_jitter()
p

  1. Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating and add the line to your plot using abline(m_bty). Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?
m_bty = lm(score~bty_avg,data=evals)
p + geom_abline(aes(slope=m_bty$coef[2],intercept=m_bty$coef[1]))
## Warning in data.frame(slope = structure(0.0666370370198145, .Names =
## "bty_avg"), : row names were found from a short variable and have been
## discarded

  1. Use residual plots to evaluate whether the conditions of least squares regression are reasonable.
plot(fortify(m_bty)$.resid ~ fortify(m_bty)$.fitted)

qqnorm(residuals(m_bty));
qqline(residuals(m_bty), col = 2)

hist(residuals(m_bty),breaks = 20)

Here you can see that the residuals aren’t really normally distributed. Probably close enough (maybe?!?) in this case, but not a perfect fit.

plot(rstandard(m_bty)~fitted(m_bty),ylim=c(-4,4))
abline(h=2,lty=2)
abline(h=-2,lty=2)
abline(h=0,lty=2)

There doesn’t appear to be much heteroskedasticity, but we do see a lot of high-magnitude residuals on the low side.

plot(rstandard(m_bty)~evals$bty_avg)

By plotting the residuals against the independent variable, we can look for potential autocollinearity - in this case it looks pretty good, in that the values seem to hop around randomly around 0 and we don’t see patterns of positive or negative residuals.

Multiple linear regression

The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let’s take a look at the relationship between one of these scores and the average beauty score.

plot(evals$bty_avg ~ evals$bty_f1lower)

cor(evals$bty_avg, evals$bty_f1lower)
## [1] 0.8439112

As expected the relationship is quite strong - after all, the average score is calculated using the individual scores. We can actually take a look at the relationships between all beauty variables (columns 13 through 19) using the following command:

plot(evals[,13:19])

These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.

In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender of the professor, we can add the gender term into the model.

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07
  1. Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg?

Yes, it appears that looks still are a significant predictor, since the p-value for t.test is very low.

Note that the estimate for gender is now called gendermale. You’ll see this name change whenever you introduce a categorical variable. The reason is that R recodes gender from having the values of female and male to being an indicator variable called gendermale that takes a value of \(0\) for females and a value of \(1\) for males. (Such variables are often referred to as “dummy” variables.) The decision to call the indicator variable gendermale instead ofgenderfemale has no deeper meaning. R simply codes the category that comes first alphabetically as a \(0\). (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using therelevel function. Use ?relevel to learn more.)

As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.

\[ \begin{aligned} \widehat{score} &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg + \hat{\beta}_2 \times (0) \\ &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg\end{aligned} \]

We can plot this line and the line corresponding to males with the following custom function.

multiLines(m_bty_gen)

  1. What is the equation of the line corresponding to males? (Hint: For males, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score?

On average, males seem to have higher eval scores.

  1. Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured.
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)
## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05

The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for bty_avg reflects how much higher a group of professors is expected to score if they have a beauty rating that is one point higher while holding all other variables constant. In this case, that translates into considering only professors of the same rank with bty_avg scores that are one point apart.

Model Selection

What do we mean by “best”?

While we’d like to find the “true” model, in practice we just hope we’re doing a good job at:

  1. Prediction
  2. Description

Selecting covariates

Searching the model space

For a given data set with many potential predictors, there exists many possible models. How should we systematically evaluate those models?

Model search strategies

  1. All best subsets
  2. Backwards elimination
  3. Forward selection
  • Note that each method may choose a different model!

Example: Bridge Building

The data

bridge <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/bridge.txt", header=TRUE)
head(bridge)
##   Case  Time DArea CCost Dwgs Length Spans
## 1    1  78.8  3.60  82.4    6     90     1
## 2    2 309.5  5.33 422.3   12    126     2
## 3    3 184.5  6.29 179.8    9     78     1
## 4    4  69.6  2.20 100.0    5     60     1
## 5    5  68.8  1.44 103.0    5     60     1
## 6    6  95.7  5.40 134.4    5     60     1

All best subsets

logDArea <- log(bridge$DArea)
logCCost <- log(bridge$CCost)
logDwgs <- log(bridge$Dwgs)
logLength <- log(bridge$Length)
logSpans <- log(bridge$Spans)
X <- cbind(logDArea,logCCost,logDwgs,logLength,logSpans)
library(leaps)
b <- regsubsets(as.matrix(X), log(bridge$Time))
summary(b)$outmat
##          logDArea logCCost logDwgs logLength logSpans
## 1  ( 1 ) " "      " "      "*"     " "       " "     
## 2  ( 1 ) " "      " "      "*"     " "       "*"     
## 3  ( 1 ) " "      "*"      "*"     " "       "*"     
## 4  ( 1 ) "*"      "*"      "*"     " "       "*"     
## 5  ( 1 ) "*"      "*"      "*"     "*"       "*"

Build best models

summary(b)$outmat
##          logDArea logCCost logDwgs logLength logSpans
## 1  ( 1 ) " "      " "      "*"     " "       " "     
## 2  ( 1 ) " "      " "      "*"     " "       "*"     
## 3  ( 1 ) " "      "*"      "*"     " "       "*"     
## 4  ( 1 ) "*"      "*"      "*"     " "       "*"     
## 5  ( 1 ) "*"      "*"      "*"     "*"       "*"
m1 <- lm(log(bridge$Time) ~ logDwgs)
m2 <- lm(log(bridge$Time) ~ logDwgs + logSpans)
m3 <- lm(log(bridge$Time) ~ logDwgs + logSpans + logCCost)
m4 <- lm(log(bridge$Time) ~ logDwgs + logSpans + logCCost + logDArea)
m5 <- lm(log(bridge$Time) ~ logDwgs + logSpans + logCCost + logDArea + logLength)
models <- list(m1, m2, m3, m4, m5)

Compare best models

Comparing best models

summary(m2)$coef
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.6617320  0.2687132 9.905477 1.488686e-12
## logDwgs     1.0416321  0.1541992 6.755105 3.259358e-08
## logSpans    0.2853049  0.0909484 3.136997 3.115580e-03
summary(m3)$coef
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.3316935  0.3576635 6.519237 7.901762e-08
## logDwgs     0.8355863  0.2135074 3.913618 3.356038e-04
## logSpans    0.1962899  0.1107299 1.772692 8.370984e-02
## logCCost    0.1482750  0.1074828 1.379522 1.752120e-01

F-tests for comparison

Two nested models may be compared using the anova() function. The anova() command computes analysis of variance (or deviance) tables. When given one model as an argument, it displays the ANOVA table. When two (or more) nested models are given, it calculates the differences between them.

anova(m2,m3)
## Analysis of Variance Table
## 
## Model 1: log(bridge$Time) ~ logDwgs + logSpans
## Model 2: log(bridge$Time) ~ logDwgs + logSpans + logCCost
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     42 4.0488                           
## 2     41 3.8693  1    0.1796 1.9031 0.1752

You can also conduct a joint test of significance for all model parameters. The f-statistic shown in the model summary output is the result of a test comparing your model to the completely restricted intercept-only model. In other words, this tests whether all non-intercept model parameters are equal to zero. If you reject this hypothesis, it does not mean that all variables are significant; rather, it means that at least one parameter is not equal to zero

summary(m2)
## 
## Call:
## lm(formula = log(bridge$Time) ~ logDwgs + logSpans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68649 -0.24728 -0.05988  0.26050  0.63759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.66173    0.26871   9.905 1.49e-12 ***
## logDwgs      1.04163    0.15420   6.755 3.26e-08 ***
## logSpans     0.28530    0.09095   3.137  0.00312 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3105 on 42 degrees of freedom
## Multiple R-squared:  0.7642, Adjusted R-squared:  0.753 
## F-statistic: 68.08 on 2 and 42 DF,  p-value: 6.632e-14

The function drop1() computes a table of changes in fit for each term in the named linear model object.

drop1(m3)
## Single term deletions
## 
## Model:
## log(bridge$Time) ~ logDwgs + logSpans + logCCost
##          Df Sum of Sq    RSS      AIC
## <none>                3.8693 -102.412
## logDwgs   1   1.44544 5.3147  -90.128
## logSpans  1   0.29656 4.1658 -101.089
## logCCost  1   0.17960 4.0488 -102.370

log-likelihood

You can also compare models using the log-likelihood; AIC and BIC scores are based upon the log-likelihood. The formula for AIC is:

\[ AIC = -2 * logLikelihood + 2 * npar\]

and for BIC:

\[ BIC = -2 * loglikelihood + log(n) * npar\]

*Note that BIC differs from AIC in that it accounts for the total number of observations.

logLik(m2)*-2 + 2 * length(m2$terms) # this...
## 'log Lik.' 25.33412 (df=4)
AIC(m2) # should equal this...
## [1] 27.33412

Comparing best models

The two and the three predictor model both do well, but the two predictor model is preferred because of the statistical significance of the predictors.

Be aware that when looking at p-values of many many models, the burden of proof should be much higher. That is, we need very low p-values to be convinced of statistical significance.

Backward elimination

backAIC <- step(m5, direction = "backward", data = bridge)
## Start:  AIC=-98.71
## log(bridge$Time) ~ logDwgs + logSpans + logCCost + logDArea + 
##     logLength
## 
##             Df Sum of Sq    RSS      AIC
## - logLength  1   0.00607 3.8497 -100.640
## - logDArea   1   0.01278 3.8564 -100.562
## <none>                   3.8436  -98.711
## - logCCost   1   0.18162 4.0252  -98.634
## - logSpans   1   0.26616 4.1098  -97.698
## - logDwgs    1   1.45358 5.2972  -86.277
## 
## Step:  AIC=-100.64
## log(bridge$Time) ~ logDwgs + logSpans + logCCost + logDArea
## 
##            Df Sum of Sq    RSS      AIC
## - logDArea  1   0.01958 3.8693 -102.412
## <none>                  3.8497 -100.640
## - logCCost  1   0.18064 4.0303 -100.577
## - logSpans  1   0.31501 4.1647  -99.101
## - logDwgs   1   1.44946 5.2991  -88.260
## 
## Step:  AIC=-102.41
## log(bridge$Time) ~ logDwgs + logSpans + logCCost
## 
##            Df Sum of Sq    RSS      AIC
## <none>                  3.8693 -102.412
## - logCCost  1   0.17960 4.0488 -102.370
## - logSpans  1   0.29656 4.1658 -101.089
## - logDwgs   1   1.44544 5.3147  -90.128
backBIC <- step(m5, direction = "backward", data = bridge, k = log(nrow(bridge)))
## Start:  AIC=-87.87
## log(bridge$Time) ~ logDwgs + logSpans + logCCost + logDArea + 
##     logLength
## 
##             Df Sum of Sq    RSS     AIC
## - logLength  1   0.00607 3.8497 -91.607
## - logDArea   1   0.01278 3.8564 -91.529
## - logCCost   1   0.18162 4.0252 -89.600
## - logSpans   1   0.26616 4.1098 -88.665
## <none>                   3.8436 -87.871
## - logDwgs    1   1.45358 5.2972 -77.244
## 
## Step:  AIC=-91.61
## log(bridge$Time) ~ logDwgs + logSpans + logCCost + logDArea
## 
##            Df Sum of Sq    RSS     AIC
## - logDArea  1   0.01958 3.8693 -95.185
## - logCCost  1   0.18064 4.0303 -93.350
## - logSpans  1   0.31501 4.1647 -91.874
## <none>                  3.8497 -91.607
## - logDwgs   1   1.44946 5.2991 -81.034
## 
## Step:  AIC=-95.19
## log(bridge$Time) ~ logDwgs + logSpans + logCCost
## 
##            Df Sum of Sq    RSS     AIC
## - logCCost  1   0.17960 4.0488 -96.950
## - logSpans  1   0.29656 4.1658 -95.669
## <none>                  3.8693 -95.185
## - logDwgs   1   1.44544 5.3147 -84.708
## 
## Step:  AIC=-96.95
## log(bridge$Time) ~ logDwgs + logSpans
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  4.0488 -96.950
## - logSpans  1    0.9487 4.9975 -91.284
## - logDwgs   1    4.3989 8.4478 -67.661

Forward selection

mint <- lm(log(Time) ~ 1, data = bridge)
forwardAIC <- step(mint,
                   scope = list(lower = ~1, upper = ~log(DArea) + log(CCost) + 
                                  log(Dwgs) + log(Length) + log(Spans)), 
                   direction = "forward", data = bridge)
## Start:  AIC=-41.35
## log(Time) ~ 1
## 
##               Df Sum of Sq     RSS     AIC
## + log(Dwgs)    1   12.1765  4.9975 -94.898
## + log(CCost)   1   11.6147  5.5593 -90.104
## + log(DArea)   1   10.2943  6.8797 -80.514
## + log(Length)  1   10.0120  7.1620 -78.704
## + log(Spans)   1    8.7262  8.4478 -71.274
## <none>                     17.1740 -41.347
## 
## Step:  AIC=-94.9
## log(Time) ~ log(Dwgs)
## 
##               Df Sum of Sq    RSS      AIC
## + log(Spans)   1   0.94866 4.0488 -102.370
## + log(CCost)   1   0.83170 4.1658 -101.089
## + log(Length)  1   0.66914 4.3284  -99.366
## + log(DArea)   1   0.47568 4.5218  -97.399
## <none>                     4.9975  -94.898
## 
## Step:  AIC=-102.37
## log(Time) ~ log(Dwgs) + log(Spans)
## 
##               Df Sum of Sq    RSS     AIC
## + log(CCost)   1  0.179598 3.8693 -102.41
## <none>                     4.0488 -102.37
## + log(DArea)   1  0.018535 4.0303 -100.58
## + log(Length)  1  0.016924 4.0319 -100.56
## 
## Step:  AIC=-102.41
## log(Time) ~ log(Dwgs) + log(Spans) + log(CCost)
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     3.8693 -102.41
## + log(DArea)   1  0.019578 3.8497 -100.64
## + log(Length)  1  0.012868 3.8564 -100.56
forwardBIC <- step(mint,
                   scope = list(lower = ~1, upper = ~log(DArea) + log(CCost) + 
                                  log(Dwgs) + log(Length) + log(Spans)),
                   direction = "forward", data = bridge,k = log(nrow(bridge)))
## Start:  AIC=-39.54
## log(Time) ~ 1
## 
##               Df Sum of Sq     RSS     AIC
## + log(Dwgs)    1   12.1765  4.9975 -91.284
## + log(CCost)   1   11.6147  5.5593 -86.490
## + log(DArea)   1   10.2943  6.8797 -76.901
## + log(Length)  1   10.0120  7.1620 -75.091
## + log(Spans)   1    8.7262  8.4478 -67.661
## <none>                     17.1740 -39.540
## 
## Step:  AIC=-91.28
## log(Time) ~ log(Dwgs)
## 
##               Df Sum of Sq    RSS     AIC
## + log(Spans)   1   0.94866 4.0488 -96.950
## + log(CCost)   1   0.83170 4.1658 -95.669
## + log(Length)  1   0.66914 4.3284 -93.946
## + log(DArea)   1   0.47568 4.5218 -91.979
## <none>                     4.9975 -91.284
## 
## Step:  AIC=-96.95
## log(Time) ~ log(Dwgs) + log(Spans)
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     4.0488 -96.950
## + log(CCost)   1  0.179598 3.8693 -95.185
## + log(DArea)   1  0.018535 4.0303 -93.350
## + log(Length)  1  0.016924 4.0319 -93.332

Stepwise compared

Backward Elimination

Optimal model based on AIC included log(CCost), log(Dwgs), and log(Spans). Optimal model based on BIC included only log(Dwgs) and log(Spans).

Forward Selection

Using AIC, the optimal model is the same as by backward AIC. Using BIC, the optimal model is the same as backward BIC.

We have the same choice between the 2 and 3 predictor models.

Back to professor example:

We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.

  1. Which variable would you expect to have the highest p-value in this model? Why? Hint: Think about which variable would you expect to not have any association with the professor score.

Let’s run the model…

m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
    cls_students + cls_level + cls_profs + cls_credits + bty_avg + pic_outfit + 
    pic_color, data = evals)
summary(m_full)
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77397 -0.32432  0.09067  0.35183  0.95036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0952141  0.2905277  14.096  < 2e-16 ***
## ranktenure track      -0.1475932  0.0820671  -1.798  0.07278 .  
## ranktenured           -0.0973378  0.0663296  -1.467  0.14295    
## ethnicitynot minority  0.1234929  0.0786273   1.571  0.11698    
## gendermale             0.2109481  0.0518230   4.071 5.54e-05 ***
## languagenon-english   -0.2298112  0.1113754  -2.063  0.03965 *  
## age                   -0.0090072  0.0031359  -2.872  0.00427 ** 
## cls_perc_eval          0.0053272  0.0015393   3.461  0.00059 ***
## cls_students           0.0004546  0.0003774   1.205  0.22896    
## cls_levelupper         0.0605140  0.0575617   1.051  0.29369    
## cls_profssingle       -0.0146619  0.0519885  -0.282  0.77806    
## cls_creditsone credit  0.5020432  0.1159388   4.330 1.84e-05 ***
## bty_avg                0.0400333  0.0175064   2.287  0.02267 *  
## pic_outfitnot formal  -0.1126817  0.0738800  -1.525  0.12792    
## pic_colorcolor        -0.2172630  0.0715021  -3.039  0.00252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared:  0.1871, Adjusted R-squared:  0.1617 
## F-statistic: 7.366 on 14 and 448 DF,  p-value: 6.552e-14

  1. Check your suspicions from the previous exercise. Include the model output in your response.

  2. Interpret the coefficient associated with the ethnicity variable.

Non-minority students are predicted to give eval scores that are 0.12 units higher on average, holding all other varaibles constant.

  1. Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?
m_drop_cls_profs <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level  + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
summary(m_drop_cls_profs)
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7836 -0.3257  0.0859  0.3513  0.9551 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0872523  0.2888562  14.150  < 2e-16 ***
## ranktenure track      -0.1476746  0.0819824  -1.801 0.072327 .  
## ranktenured           -0.0973829  0.0662614  -1.470 0.142349    
## ethnicitynot minority  0.1274458  0.0772887   1.649 0.099856 .  
## gendermale             0.2101231  0.0516873   4.065 5.66e-05 ***
## languagenon-english   -0.2282894  0.1111305  -2.054 0.040530 *  
## age                   -0.0089992  0.0031326  -2.873 0.004262 ** 
## cls_perc_eval          0.0052888  0.0015317   3.453 0.000607 ***
## cls_students           0.0004687  0.0003737   1.254 0.210384    
## cls_levelupper         0.0606374  0.0575010   1.055 0.292200    
## cls_creditsone credit  0.5061196  0.1149163   4.404 1.33e-05 ***
## bty_avg                0.0398629  0.0174780   2.281 0.023032 *  
## pic_outfitnot formal  -0.1083227  0.0721711  -1.501 0.134080    
## pic_colorcolor        -0.2190527  0.0711469  -3.079 0.002205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4974 on 449 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.1634 
## F-statistic: 7.943 on 13 and 449 DF,  p-value: 2.336e-14
  1. Conduct an F-test comparing the two models from problem 14 (i.e., the unrestricted model and the one where you dropped the variable with the highest p-value). Report your results in a complete sentence.
anova(m_full,m_drop_cls_profs)
## Analysis of Variance Table
## 
## Model 1: score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_profs + cls_credits + bty_avg + 
##     pic_outfit + pic_color
## Model 2: score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_credits + bty_avg + pic_outfit + 
##     pic_color
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    448 111.08                           
## 2    449 111.11 -1 -0.019722 0.0795 0.7781

We conclude that the restricted model (without the cls_profs variable) is a better fit for the data, since \(p > 0.05\) (much higher, actually). In other words, adding cls_profs does not improve the model.

  1. Using backward-selection and p-value as the selection criterion, determine the best model. Write out the linear model for predicting score based on the final model you settle on.
backBIC <- step(m_full, direction = "backward", data = evals, k = log(nrow(evals)))
## Start:  AIC=-568.84
## score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_profs + cls_credits + bty_avg + 
##     pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - rank           2    0.8930 111.98 -577.40
## - cls_profs      1    0.0197 111.11 -574.89
## - cls_level      1    0.2740 111.36 -573.83
## - cls_students   1    0.3599 111.44 -573.48
## - pic_outfit     1    0.5768 111.66 -572.58
## - ethnicity      1    0.6117 111.70 -572.43
## - language       1    1.0557 112.14 -570.59
## - bty_avg        1    1.2967 112.38 -569.60
## <none>                       111.08 -568.84
## - age            1    2.0456 113.13 -566.52
## - pic_color      1    2.2893 113.37 -565.53
## - cls_perc_eval  1    2.9698 114.06 -562.76
## - gender         1    4.1085 115.19 -558.16
## - cls_credits    1    4.6495 115.73 -555.99
## 
## Step:  AIC=-577.4
## score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_profs + cls_credits + bty_avg + 
##     pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_profs      1    0.0207 112.00 -583.46
## - cls_level      1    0.1943 112.17 -582.74
## - cls_students   1    0.3012 112.28 -582.30
## - pic_outfit     1    0.3797 112.36 -581.97
## - ethnicity      1    0.9016 112.88 -579.83
## - language       1    1.3048 113.28 -578.18
## <none>                       111.98 -577.40
## - bty_avg        1    1.6136 113.59 -576.92
## - age            1    1.6479 113.63 -576.78
## - pic_color      1    2.0335 114.01 -575.21
## - cls_perc_eval  1    3.0707 115.05 -571.02
## - gender         1    3.9584 115.94 -567.46
## - cls_credits    1    6.0651 118.04 -559.12
## 
## Step:  AIC=-583.46
## score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_credits + bty_avg + pic_outfit + 
##     pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_level      1    0.1952 112.19 -588.79
## - cls_students   1    0.3285 112.33 -588.24
## - pic_outfit     1    0.3592 112.36 -588.11
## - ethnicity      1    0.9841 112.98 -585.54
## - language       1    1.2919 113.29 -584.28
## <none>                       112.00 -583.46
## - bty_avg        1    1.6029 113.60 -583.01
## - age            1    1.6436 113.64 -582.85
## - pic_color      1    2.0869 114.08 -581.05
## - cls_perc_eval  1    3.0501 115.05 -577.15
## - gender         1    3.9385 115.94 -573.59
## - cls_credits    1    6.2658 118.26 -564.39
## 
## Step:  AIC=-588.79
## score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_students   1    0.2211 112.42 -594.01
## - pic_outfit     1    0.4463 112.64 -593.09
## - language       1    1.1646 113.36 -590.14
## - ethnicity      1    1.1727 113.37 -590.11
## <none>                       112.19 -588.79
## - age            1    1.5573 113.75 -588.54
## - bty_avg        1    1.6584 113.85 -588.13
## - pic_color      1    1.8933 114.09 -587.18
## - cls_perc_eval  1    3.1191 115.31 -582.23
## - gender         1    3.8038 116.00 -579.49
## - cls_credits    1    6.2118 118.41 -569.97
## 
## Step:  AIC=-594.01
## score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - pic_outfit     1    0.7141 113.13 -597.22
## - ethnicity      1    1.1790 113.59 -595.32
## - language       1    1.3403 113.75 -594.66
## <none>                       112.42 -594.01
## - age            1    1.6847 114.10 -593.26
## - pic_color      1    1.7841 114.20 -592.86
## - bty_avg        1    1.8553 114.27 -592.57
## - cls_perc_eval  1    2.9147 115.33 -588.30
## - gender         1    4.0577 116.47 -583.73
## - cls_credits    1    6.1208 118.54 -575.60
## 
## Step:  AIC=-597.22
## score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - language       1    0.9862 114.11 -599.34
## - ethnicity      1    1.2393 114.37 -598.31
## - age            1    1.3350 114.46 -597.93
## <none>                       113.13 -597.22
## - pic_color      1    1.9951 115.12 -595.26
## - bty_avg        1    2.2663 115.39 -594.17
## - cls_perc_eval  1    2.6225 115.75 -592.75
## - gender         1    4.2526 117.38 -586.27
## - cls_credits    1    5.8690 119.00 -579.94
## 
## Step:  AIC=-599.34
## score ~ ethnicity + gender + age + cls_perc_eval + cls_credits + 
##     bty_avg + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - age            1    1.3296 115.44 -600.11
## <none>                       114.11 -599.34
## - pic_color      1    1.6282 115.74 -598.92
## - ethnicity      1    2.3190 116.43 -596.16
## - bty_avg        1    2.3927 116.51 -595.87
## - cls_perc_eval  1    2.6951 116.81 -594.67
## - gender         1    4.0406 118.16 -589.37
## - cls_credits    1    6.4797 120.59 -579.91
## 
## Step:  AIC=-600.11
## score ~ ethnicity + gender + cls_perc_eval + cls_credits + bty_avg + 
##     pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - pic_color      1    1.2631 116.71 -601.21
## <none>                       115.44 -600.11
## - ethnicity      1    2.2117 117.66 -597.46
## - cls_perc_eval  1    2.8965 118.34 -594.78
## - gender         1    3.1203 118.56 -593.90
## - bty_avg        1    3.9452 119.39 -590.69
## - cls_credits    1    6.7615 122.21 -579.90
## 
## Step:  AIC=-601.21
## score ~ ethnicity + gender + cls_perc_eval + cls_credits + bty_avg
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       116.71 -601.21
## - gender         1    2.7053 119.41 -596.74
## - ethnicity      1    2.7477 119.46 -596.58
## - cls_perc_eval  1    3.3244 120.03 -594.35
## - bty_avg        1    5.5674 122.28 -585.77
## - cls_credits    1    6.8241 123.53 -581.04
backBIC
## 
## Call:
## lm(formula = score ~ ethnicity + gender + cls_perc_eval + cls_credits + 
##     bty_avg, data = evals)
## 
## Coefficients:
##           (Intercept)  ethnicitynot minority             gendermale  
##              3.137381               0.233794               0.157832  
##         cls_perc_eval  cls_creditsone credit                bty_avg  
##              0.005208               0.541067               0.073644

So, our best fit model (using BIC as the criterion) is:

\(eval ~ 3.13 + 0.23 * NotMinority + 0.16 * Male + 0.005 * cls_perc_eval + 0.54 * cls_creditsone + 0.07 * bty_avg\)

  1. Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.

  2. Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?

Regression Planes

Last week, we examined bivariate regressions in part by fitting the regression line to a scatter plot of x and y values. Obviously, we can’t do exactly the same thing with multiple regression, because you might have any number of different x variables (i.e., \(x_1\),\(x_2\), \(x_3\)\(x_k\)). To make things concrete, we will talk about the relationship between education, father’s education, and income using data from the GSS:

\[ income_i ~ fatherseducation_i + education_i + \epsilon_i\]

The code below reads in the data and cleans a few thigns up.

suppressPackageStartupMessages(library(dplyr))
load("input/gss_2010_training.RData")
gss.training <- tbl_df(gss.training)
gss <- select(gss.training, income06_n, educ, maeduc, paeduc) %>%
  filter(!is.na(income06_n), !is.na(educ), !is.na(maeduc), !is.na(paeduc))
# NOTE: DROPPING MISSING DATA LIKE THIS CAN BE DANGEROUS
gss <- dplyr::rename(gss, income = income06_n)

One way to look at these data would be to plot each combination of variables against one another. There’s a nice package called GGally that has an easy built in function to make this type of plot:

suppressPackageStartupMessages(library(GGally))
pm <- ggpairs(select(gss, educ, paeduc, income))
pm

Another way to think about the relationship amongst multiple variables is in terms of more than 2 dimensions (don’t think beyond 3 dimensions though, you don’t want to hurt yourself!). Load the scatterplot3d package…

library(scatterplot3d)

And then specify a three-dimensional plot using all three variables:

scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income, 
              xlab="Education", ylab="Father's education", 
              zlab="Income category", pch=20, angle=20)

You can angle the plot to get a better look…

scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income, 
              xlab="Education", ylab="Father's education", 
              zlab="Income category", pch=20, angle=40)

scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income, 
              xlab="Education", ylab="Father's education", 
              zlab="Income category", pch=20, angle=80)

Now, instead of a regression line, we will plot a regression plane. Because the model we specifiy assumes that the effect of each variable is constant net of the other variables, this means that (by assumption) the “slope” (if you will) for each individual variable does not depend on the value of the other variable(s). Thus, we can image the regression line (for instance, education ~ father’s education) extending horizontally to form a plane across all potential values of income:

s3d <- scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income, 
              xlab="Education", ylab="Father's education", 
              zlab="Income category", pch=20, angle=80)
my.lm <- lm(gss$income ~ gss$educ + gss$paeduc)
s3d$plane3d(my.lm,col='red',lwd=2)

Again, we can change up the angle to get a different perspective…

s3d <- scatterplot3d(x=gss$educ, y=gss$paeduc, z=gss$income, 
              xlab="Education", ylab="Father's education", 
              zlab="Income category", pch=20, angle=50)
my.lm <- lm(gss$income ~ gss$educ + gss$paeduc)
s3d$plane3d(my.lm,col='red',lwd=2)

Model is: \[\widehat{\mbox{income}}_i = \beta_0 + \beta_1 \times \mbox{educ}_i + \beta_2 \times \mbox{paeduc}_i\]

Just as with the OLS line for bivariate regression, the regression plane is the plane that minimizes the sum of the squared residuals. The residual is the difference between the predicted income and actual income for each person in the sample.

\[\mbox{income}_i = \beta_0 + \beta_1 \times \mbox{educ}_i + \beta_2 \times \mbox{paeduc}_i + \mbox{residual}_i\]

\[\widehat{\mbox{income}}_i = \beta_0 + \beta_1 \times \mbox{educ}_i + \beta_2 \times \mbox{paeduc}_i\]

As with the line, it is well defined that this is the best fit plane to the whole dataset, but why is that what we want?

Extrapolation and interpolation in multiple regression

Thankfully, prediction with multiple regression is basically the same as prediction in bivariate regression, albeit with more algebra involved. Because there can be many terms involved, you are better off using R’s predict function instead of manual calculation.

  1. What is the predicted income for someone whose father has 12 years of education and who has 12 years of education? If we just use predict(my.lm) R will give you the fitted value for each observation. In order to make a prediction for this new observation, we need to make a new dataframe:
my.lm <- lm(income ~ paeduc + educ, data=gss)
new.observation = data.frame('paeduc' = 12,'educ'=12)
predict.lm(my.lm,newdata = new.observation)
##        1 
## 17.41802
  1. What is the predicted income for someone whose father has 20 years of education and who has 0 years of education?
my.lm <- lm(income ~ paeduc + educ, data=gss)
new.observation = data.frame('paeduc' = 20,'educ'=0)
predict.lm(my.lm,newdata = new.observation)
##        1 
## 10.34044

Multicollinearity

Example: Car seat position

library(faraway)
data(seatpos)
head(seatpos)
##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3  -206.300
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9  -178.210
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0   -71.673
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0  -257.720
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9  -173.230
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4  -185.150

We’ll model hipcenter as a function of all other variables in the dataset…

m1 <- lm(hipcenter ~ ., data = seatpos)
# the dot in the formula notation means 'all other variables'
summary(m1)$coef
##                 Estimate  Std. Error     t value   Pr(>|t|)
## (Intercept) 436.43212823 166.5716187  2.62008697 0.01384361
## Age           0.77571620   0.5703288  1.36012113 0.18427175
## Weight        0.02631308   0.3309704  0.07950283 0.93717877
## HtShoes      -2.69240774   9.7530351 -0.27605845 0.78446097
## Ht            0.60134458  10.1298739  0.05936348 0.95306980
## Seated        0.53375170   3.7618942  0.14188376 0.88815293
## Arm          -1.32806864   3.9001969 -0.34051323 0.73592450
## Thigh        -1.14311888   2.6600237 -0.42974011 0.67056106
## Leg          -6.43904627   4.7138601 -1.36598163 0.18244531

Assessing multicollinearity

We have several tools at our disposal to assess multicollinearity

  • Pairs plot: look for strong linear relationships between predictors.
  • Correlation matrix: calculate the correlation between your predictors using cor().
  • Variance Inflation Factors (VIF):

Pairs plot

Correlation matrix

round(cor(seatpos),2)
##             Age Weight HtShoes    Ht Seated   Arm Thigh   Leg hipcenter
## Age        1.00   0.08   -0.08 -0.09  -0.17  0.36  0.09 -0.04      0.21
## Weight     0.08   1.00    0.83  0.83   0.78  0.70  0.57  0.78     -0.64
## HtShoes   -0.08   0.83    1.00  1.00   0.93  0.75  0.72  0.91     -0.80
## Ht        -0.09   0.83    1.00  1.00   0.93  0.75  0.73  0.91     -0.80
## Seated    -0.17   0.78    0.93  0.93   1.00  0.63  0.61  0.81     -0.73
## Arm        0.36   0.70    0.75  0.75   0.63  1.00  0.67  0.75     -0.59
## Thigh      0.09   0.57    0.72  0.73   0.61  0.67  1.00  0.65     -0.59
## Leg       -0.04   0.78    0.91  0.91   0.81  0.75  0.65  1.00     -0.79
## hipcenter  0.21  -0.64   -0.80 -0.80  -0.73 -0.59 -0.59 -0.79      1.00

Correlations above 0.7 will often induce considerable variance in your slopes.

Variance Inflation Factor

library(car)
## 
## Attaching package: 'car'
## 
## The following objects are masked from 'package:faraway':
## 
##     logit, vif
m1 <- lm(hipcenter ~ ., data = seatpos)
vif(m1)
##        Age     Weight    HtShoes         Ht     Seated        Arm 
##   1.997931   3.647030 307.429378 333.137832   8.951054   4.496368 
##      Thigh        Leg 
##   2.762886   6.694291

Rule of thumb: VIFs greater than 5 should be addressed.

Simplifying seatpos

Since most of these variables measure some version of height, we can just use one of them.

m2 <- lm(hipcenter ~ Age + Weight + Ht, data = seatpos)
summary(m2)
## 
## Call:
## lm(formula = hipcenter ~ Age + Weight + Ht, data = seatpos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.526 -23.005   2.164  24.950  53.982 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 528.297729 135.312947   3.904 0.000426 ***
## Age           0.519504   0.408039   1.273 0.211593    
## Weight        0.004271   0.311720   0.014 0.989149    
## Ht           -4.211905   0.999056  -4.216 0.000174 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.49 on 34 degrees of freedom
## Multiple R-squared:  0.6562, Adjusted R-squared:  0.6258 
## F-statistic: 21.63 on 3 and 34 DF,  p-value: 5.125e-08
vif(m2)
##      Age   Weight       Ht 
## 1.093018 3.457681 3.463303

Goal check?

Questions?